1.- DEFINIR OBJETIVOS. Los objetivos de la práctica son aplicar las técnicas de análisis exploratorio de datos (EDA) para analizar los patrones presentes en un conjunto de datos relacionados con solicitudes de préstamos. Realizaremos un análisis EDA para analizar los patrones presentes en los datos, asegurando que los solicitantes capaces de pagar el préstamo no sean rechazados. Se identificarán patrones que indiquen si un cliente puede tener dificultades para devolver el préstamo, lo que puede ser utilizado para tomar acciones. Para esto cogeremos la variable objetivo "TARGET".
El trabajo se va a dividir en los mismos notebooks que los HTML de guía y se tratará desde el notebook 01_Exploración general, donde se realizan los primeros pasos del EDA como la carga del dataframe, tratamiento de variables a futuro, dimensiones del dataset..., hasta la parte de modelo (no incluida) del notebook 03_encoding_categorical_scaled_vars.
En cuanto al problema a resolver, sera poder predecir la probabilidad de que un cliente tenga dificultades para pagar un préstamo. Es un problema de clasificación binaria, donde las clases son: 0: El cliente no tuvo dificultades para pagar el préstamo. 1: El cliente tuvo dificultades para pagar el préstamo. La meta principal es desarrollar un modelo que permita anticipar clientes con alto riesgo de incumplimiento de pagos.
Las variables a utilizar las iremos viendo a través del análisis del dataset, algunas interesantes a priori podrían ser el tipo de préstamo, el trabajo que desarrollan los clientes...
En primer lugar, cargamos todas las librerías y ponemos el máximo de columnas y filas que va a mostrar asi como el ancho a mostrar del dataframe
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import plotly.io as pio
import sys
import os
pio.renderers.default = 'notebook'
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.width', 10000)
sys.path.append(os.path.abspath('../src'))
import f_aux
Importo el diccionario de datos
dict_path = '../data/columns_description.xlsx'
dictionary = pd.read_excel(dict_path)
dictionary.head(1)
| Unnamed: 0 | Table | Row | Description | Special | |
|---|---|---|---|---|---|
| 0 | 1 | application_data | SK_ID_CURR | ID of loan in our sample | NaN |
Veo que tiene una columna creada tras la lectura del mismo como excel, procedo a eliminarla
dictionary = dictionary.drop(columns=['Unnamed: 0'])
Ahora se puede ver el total de columnas y que significa cada una de ellas
dictionary.head(122)
| Table | Row | Description | Special | |
|---|---|---|---|---|
| 0 | application_data | SK_ID_CURR | ID of loan in our sample | NaN |
| 1 | application_data | TARGET | Target variable (1 - client with payment diffi... | NaN |
| 2 | application_data | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | NaN |
| 3 | application_data | CODE_GENDER | Gender of the client | NaN |
| 4 | application_data | FLAG_OWN_CAR | Flag if the client owns a car | NaN |
| 5 | application_data | FLAG_OWN_REALTY | Flag if client owns a house or flat | NaN |
| 6 | application_data | CNT_CHILDREN | Number of children the client has | NaN |
| 7 | application_data | AMT_INCOME_TOTAL | Income of the client | NaN |
| 8 | application_data | AMT_CREDIT | Credit amount of the loan | NaN |
| 9 | application_data | AMT_ANNUITY | Loan annuity | NaN |
| 10 | application_data | AMT_GOODS_PRICE | For consumer loans it is the price of the good... | NaN |
| 11 | application_data | NAME_TYPE_SUITE | Who was accompanying client when he was applyi... | NaN |
| 12 | application_data | NAME_INCOME_TYPE | Clients income type (businessman, working, mat... | NaN |
| 13 | application_data | NAME_EDUCATION_TYPE | Level of highest education the client achieved | NaN |
| 14 | application_data | NAME_FAMILY_STATUS | Family status of the client | NaN |
| 15 | application_data | NAME_HOUSING_TYPE | What is the housing situation of the client (r... | NaN |
| 16 | application_data | REGION_POPULATION_RELATIVE | Normalized population of region where client l... | normalized |
| 17 | application_data | DAYS_BIRTH | Client's age in days at the time of application | time only relative to the application |
| 18 | application_data | DAYS_EMPLOYED | How many days before the application the perso... | time only relative to the application |
| 19 | application_data | DAYS_REGISTRATION | How many days before the application did clien... | time only relative to the application |
| 20 | application_data | DAYS_ID_PUBLISH | How many days before the application did clien... | time only relative to the application |
| 21 | application_data | OWN_CAR_AGE | Age of client's car | NaN |
| 22 | application_data | FLAG_MOBIL | Did client provide mobile phone (1=YES, 0=NO) | NaN |
| 23 | application_data | FLAG_EMP_PHONE | Did client provide work phone (1=YES, 0=NO) | NaN |
| 24 | application_data | FLAG_WORK_PHONE | Did client provide home phone (1=YES, 0=NO) | NaN |
| 25 | application_data | FLAG_CONT_MOBILE | Was mobile phone reachable (1=YES, 0=NO) | NaN |
| 26 | application_data | FLAG_PHONE | Did client provide home phone (1=YES, 0=NO) | NaN |
| 27 | application_data | FLAG_EMAIL | Did client provide email (1=YES, 0=NO) | NaN |
| 28 | application_data | OCCUPATION_TYPE | What kind of occupation does the client have | NaN |
| 29 | application_data | CNT_FAM_MEMBERS | How many family members does client have | NaN |
| 30 | application_data | REGION_RATING_CLIENT | Our rating of the region where client lives (1... | NaN |
| 31 | application_data | REGION_RATING_CLIENT_W_CITY | Our rating of the region where client lives wi... | NaN |
| 32 | application_data | WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply ... | NaN |
| 33 | application_data | HOUR_APPR_PROCESS_START | Approximately at what hour did the client appl... | rounded |
| 34 | application_data | REG_REGION_NOT_LIVE_REGION | Flag if client's permanent address does not ma... | NaN |
| 35 | application_data | REG_REGION_NOT_WORK_REGION | Flag if client's permanent address does not ma... | NaN |
| 36 | application_data | LIVE_REGION_NOT_WORK_REGION | Flag if client's contact address does not matc... | NaN |
| 37 | application_data | REG_CITY_NOT_LIVE_CITY | Flag if client's permanent address does not ma... | NaN |
| 38 | application_data | REG_CITY_NOT_WORK_CITY | Flag if client's permanent address does not ma... | NaN |
| 39 | application_data | LIVE_CITY_NOT_WORK_CITY | Flag if client's contact address does not matc... | NaN |
| 40 | application_data | ORGANIZATION_TYPE | Type of organization where client works | NaN |
| 41 | application_data | EXT_SOURCE_1 | Normalized score from external data source | normalized |
| 42 | application_data | EXT_SOURCE_2 | Normalized score from external data source | normalized |
| 43 | application_data | EXT_SOURCE_3 | Normalized score from external data source | normalized |
| 44 | application_data | APARTMENTS_AVG | Normalized information about building where th... | normalized |
| 45 | application_data | BASEMENTAREA_AVG | Normalized information about building where th... | normalized |
| 46 | application_data | YEARS_BEGINEXPLUATATION_AVG | Normalized information about building where th... | normalized |
| 47 | application_data | YEARS_BUILD_AVG | Normalized information about building where th... | normalized |
| 48 | application_data | COMMONAREA_AVG | Normalized information about building where th... | normalized |
| 49 | application_data | ELEVATORS_AVG | Normalized information about building where th... | normalized |
| 50 | application_data | ENTRANCES_AVG | Normalized information about building where th... | normalized |
| 51 | application_data | FLOORSMAX_AVG | Normalized information about building where th... | normalized |
| 52 | application_data | FLOORSMIN_AVG | Normalized information about building where th... | normalized |
| 53 | application_data | LANDAREA_AVG | Normalized information about building where th... | normalized |
| 54 | application_data | LIVINGAPARTMENTS_AVG | Normalized information about building where th... | normalized |
| 55 | application_data | LIVINGAREA_AVG | Normalized information about building where th... | normalized |
| 56 | application_data | NONLIVINGAPARTMENTS_AVG | Normalized information about building where th... | normalized |
| 57 | application_data | NONLIVINGAREA_AVG | Normalized information about building where th... | normalized |
| 58 | application_data | APARTMENTS_MODE | Normalized information about building where th... | normalized |
| 59 | application_data | BASEMENTAREA_MODE | Normalized information about building where th... | normalized |
| 60 | application_data | YEARS_BEGINEXPLUATATION_MODE | Normalized information about building where th... | normalized |
| 61 | application_data | YEARS_BUILD_MODE | Normalized information about building where th... | normalized |
| 62 | application_data | COMMONAREA_MODE | Normalized information about building where th... | normalized |
| 63 | application_data | ELEVATORS_MODE | Normalized information about building where th... | normalized |
| 64 | application_data | ENTRANCES_MODE | Normalized information about building where th... | normalized |
| 65 | application_data | FLOORSMAX_MODE | Normalized information about building where th... | normalized |
| 66 | application_data | FLOORSMIN_MODE | Normalized information about building where th... | normalized |
| 67 | application_data | LANDAREA_MODE | Normalized information about building where th... | normalized |
| 68 | application_data | LIVINGAPARTMENTS_MODE | Normalized information about building where th... | normalized |
| 69 | application_data | LIVINGAREA_MODE | Normalized information about building where th... | normalized |
| 70 | application_data | NONLIVINGAPARTMENTS_MODE | Normalized information about building where th... | normalized |
| 71 | application_data | NONLIVINGAREA_MODE | Normalized information about building where th... | normalized |
| 72 | application_data | APARTMENTS_MEDI | Normalized information about building where th... | normalized |
| 73 | application_data | BASEMENTAREA_MEDI | Normalized information about building where th... | normalized |
| 74 | application_data | YEARS_BEGINEXPLUATATION_MEDI | Normalized information about building where th... | normalized |
| 75 | application_data | YEARS_BUILD_MEDI | Normalized information about building where th... | normalized |
| 76 | application_data | COMMONAREA_MEDI | Normalized information about building where th... | normalized |
| 77 | application_data | ELEVATORS_MEDI | Normalized information about building where th... | normalized |
| 78 | application_data | ENTRANCES_MEDI | Normalized information about building where th... | normalized |
| 79 | application_data | FLOORSMAX_MEDI | Normalized information about building where th... | normalized |
| 80 | application_data | FLOORSMIN_MEDI | Normalized information about building where th... | normalized |
| 81 | application_data | LANDAREA_MEDI | Normalized information about building where th... | normalized |
| 82 | application_data | LIVINGAPARTMENTS_MEDI | Normalized information about building where th... | normalized |
| 83 | application_data | LIVINGAREA_MEDI | Normalized information about building where th... | normalized |
| 84 | application_data | NONLIVINGAPARTMENTS_MEDI | Normalized information about building where th... | normalized |
| 85 | application_data | NONLIVINGAREA_MEDI | Normalized information about building where th... | normalized |
| 86 | application_data | FONDKAPREMONT_MODE | Normalized information about building where th... | normalized |
| 87 | application_data | HOUSETYPE_MODE | Normalized information about building where th... | normalized |
| 88 | application_data | TOTALAREA_MODE | Normalized information about building where th... | normalized |
| 89 | application_data | WALLSMATERIAL_MODE | Normalized information about building where th... | normalized |
| 90 | application_data | EMERGENCYSTATE_MODE | Normalized information about building where th... | normalized |
| 91 | application_data | OBS_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 92 | application_data | DEF_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 93 | application_data | OBS_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 94 | application_data | DEF_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 95 | application_data | DAYS_LAST_PHONE_CHANGE | How many days before application did client ch... | NaN |
| 96 | application_data | FLAG_DOCUMENT_2 | Did client provide document 2 | NaN |
| 97 | application_data | FLAG_DOCUMENT_3 | Did client provide document 3 | NaN |
| 98 | application_data | FLAG_DOCUMENT_4 | Did client provide document 4 | NaN |
| 99 | application_data | FLAG_DOCUMENT_5 | Did client provide document 5 | NaN |
| 100 | application_data | FLAG_DOCUMENT_6 | Did client provide document 6 | NaN |
| 101 | application_data | FLAG_DOCUMENT_7 | Did client provide document 7 | NaN |
| 102 | application_data | FLAG_DOCUMENT_8 | Did client provide document 8 | NaN |
| 103 | application_data | FLAG_DOCUMENT_9 | Did client provide document 9 | NaN |
| 104 | application_data | FLAG_DOCUMENT_10 | Did client provide document 10 | NaN |
| 105 | application_data | FLAG_DOCUMENT_11 | Did client provide document 11 | NaN |
| 106 | application_data | FLAG_DOCUMENT_12 | Did client provide document 12 | NaN |
| 107 | application_data | FLAG_DOCUMENT_13 | Did client provide document 13 | NaN |
| 108 | application_data | FLAG_DOCUMENT_14 | Did client provide document 14 | NaN |
| 109 | application_data | FLAG_DOCUMENT_15 | Did client provide document 15 | NaN |
| 110 | application_data | FLAG_DOCUMENT_16 | Did client provide document 16 | NaN |
| 111 | application_data | FLAG_DOCUMENT_17 | Did client provide document 17 | NaN |
| 112 | application_data | FLAG_DOCUMENT_18 | Did client provide document 18 | NaN |
| 113 | application_data | FLAG_DOCUMENT_19 | Did client provide document 19 | NaN |
| 114 | application_data | FLAG_DOCUMENT_20 | Did client provide document 20 | NaN |
| 115 | application_data | FLAG_DOCUMENT_21 | Did client provide document 21 | NaN |
| 116 | application_data | AMT_REQ_CREDIT_BUREAU_HOUR | Number of enquiries to Credit Bureau about the... | NaN |
| 117 | application_data | AMT_REQ_CREDIT_BUREAU_DAY | Number of enquiries to Credit Bureau about the... | NaN |
| 118 | application_data | AMT_REQ_CREDIT_BUREAU_WEEK | Number of enquiries to Credit Bureau about the... | NaN |
| 119 | application_data | AMT_REQ_CREDIT_BUREAU_MON | Number of enquiries to Credit Bureau about the... | NaN |
| 120 | application_data | AMT_REQ_CREDIT_BUREAU_QRT | Number of enquiries to Credit Bureau about the... | NaN |
| 121 | application_data | AMT_REQ_CREDIT_BUREAU_YEAR | Number of enquiries to Credit Bureau about the... | NaN |
Vamos a trabajar con 122 variables en el dataset
Creamos una ruta relativa que cogerá los elementos .csv de la carpeta data y leemos el dataset. De esta manera, si alguien descarga en su ordenador el archivo .zip debería ser capaz de leer los .csv de la carpeta data. En este caso nos han pedido que el .zip vaya sin datos para no saturar la nube.
path_folder = "../data/"
pd_loan = pd.read_csv(path_folder +"application_data.csv", low_memory=False)
pd_loan.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
- VARIABLES A FUTURO: Vemos que variables son "variables a futuro". Esto lo podemos observar en el excel adicional donde nos explica para que funciona cada columna y podemos recogerlas para despues eliminarlas, ya que todas las variables que recojan valores futuros no son interesantes para el análisis debido a que una persona NUEVA que pide un préstamos no tiene datos anteriores a la toma del préstamo. En este caso no hay variables a futuro observadas.
- VEMOS LAS DIMENSIONES DE LA TABLA
print(len(pd_loan.columns))
print(pd_loan.shape)
122 (307511, 122)
Encontramos que el dataframe tiene unas dimensiones de 307511 filas y 122 columnas, lo que nos dice a su vez que vamos a tratar con un dataset de 122 variables.
Buscamos la variable que hace únicos a todos los registros, funciona como el identificador de las filas. Vemos que esta es 'SK_ID_CURR', esta funciona, como su propio nombre indica, como identificador o ID, de ahi que salga que tiene tantos valores únicos como filas tiene nuestro dataset.
pd_loan.columns
pd_loan['SK_ID_CURR'].nunique()
307511
- VER LOS TIPOS DE VARIABLES
pd_loan.dtypes.to_dict()
{'SK_ID_CURR': dtype('int64'),
'TARGET': dtype('int64'),
'NAME_CONTRACT_TYPE': dtype('O'),
'CODE_GENDER': dtype('O'),
'FLAG_OWN_CAR': dtype('O'),
'FLAG_OWN_REALTY': dtype('O'),
'CNT_CHILDREN': dtype('int64'),
'AMT_INCOME_TOTAL': dtype('float64'),
'AMT_CREDIT': dtype('float64'),
'AMT_ANNUITY': dtype('float64'),
'AMT_GOODS_PRICE': dtype('float64'),
'NAME_TYPE_SUITE': dtype('O'),
'NAME_INCOME_TYPE': dtype('O'),
'NAME_EDUCATION_TYPE': dtype('O'),
'NAME_FAMILY_STATUS': dtype('O'),
'NAME_HOUSING_TYPE': dtype('O'),
'REGION_POPULATION_RELATIVE': dtype('float64'),
'DAYS_BIRTH': dtype('int64'),
'DAYS_EMPLOYED': dtype('int64'),
'DAYS_REGISTRATION': dtype('float64'),
'DAYS_ID_PUBLISH': dtype('int64'),
'OWN_CAR_AGE': dtype('float64'),
'FLAG_MOBIL': dtype('int64'),
'FLAG_EMP_PHONE': dtype('int64'),
'FLAG_WORK_PHONE': dtype('int64'),
'FLAG_CONT_MOBILE': dtype('int64'),
'FLAG_PHONE': dtype('int64'),
'FLAG_EMAIL': dtype('int64'),
'OCCUPATION_TYPE': dtype('O'),
'CNT_FAM_MEMBERS': dtype('float64'),
'REGION_RATING_CLIENT': dtype('int64'),
'REGION_RATING_CLIENT_W_CITY': dtype('int64'),
'WEEKDAY_APPR_PROCESS_START': dtype('O'),
'HOUR_APPR_PROCESS_START': dtype('int64'),
'REG_REGION_NOT_LIVE_REGION': dtype('int64'),
'REG_REGION_NOT_WORK_REGION': dtype('int64'),
'LIVE_REGION_NOT_WORK_REGION': dtype('int64'),
'REG_CITY_NOT_LIVE_CITY': dtype('int64'),
'REG_CITY_NOT_WORK_CITY': dtype('int64'),
'LIVE_CITY_NOT_WORK_CITY': dtype('int64'),
'ORGANIZATION_TYPE': dtype('O'),
'EXT_SOURCE_1': dtype('float64'),
'EXT_SOURCE_2': dtype('float64'),
'EXT_SOURCE_3': dtype('float64'),
'APARTMENTS_AVG': dtype('float64'),
'BASEMENTAREA_AVG': dtype('float64'),
'YEARS_BEGINEXPLUATATION_AVG': dtype('float64'),
'YEARS_BUILD_AVG': dtype('float64'),
'COMMONAREA_AVG': dtype('float64'),
'ELEVATORS_AVG': dtype('float64'),
'ENTRANCES_AVG': dtype('float64'),
'FLOORSMAX_AVG': dtype('float64'),
'FLOORSMIN_AVG': dtype('float64'),
'LANDAREA_AVG': dtype('float64'),
'LIVINGAPARTMENTS_AVG': dtype('float64'),
'LIVINGAREA_AVG': dtype('float64'),
'NONLIVINGAPARTMENTS_AVG': dtype('float64'),
'NONLIVINGAREA_AVG': dtype('float64'),
'APARTMENTS_MODE': dtype('float64'),
'BASEMENTAREA_MODE': dtype('float64'),
'YEARS_BEGINEXPLUATATION_MODE': dtype('float64'),
'YEARS_BUILD_MODE': dtype('float64'),
'COMMONAREA_MODE': dtype('float64'),
'ELEVATORS_MODE': dtype('float64'),
'ENTRANCES_MODE': dtype('float64'),
'FLOORSMAX_MODE': dtype('float64'),
'FLOORSMIN_MODE': dtype('float64'),
'LANDAREA_MODE': dtype('float64'),
'LIVINGAPARTMENTS_MODE': dtype('float64'),
'LIVINGAREA_MODE': dtype('float64'),
'NONLIVINGAPARTMENTS_MODE': dtype('float64'),
'NONLIVINGAREA_MODE': dtype('float64'),
'APARTMENTS_MEDI': dtype('float64'),
'BASEMENTAREA_MEDI': dtype('float64'),
'YEARS_BEGINEXPLUATATION_MEDI': dtype('float64'),
'YEARS_BUILD_MEDI': dtype('float64'),
'COMMONAREA_MEDI': dtype('float64'),
'ELEVATORS_MEDI': dtype('float64'),
'ENTRANCES_MEDI': dtype('float64'),
'FLOORSMAX_MEDI': dtype('float64'),
'FLOORSMIN_MEDI': dtype('float64'),
'LANDAREA_MEDI': dtype('float64'),
'LIVINGAPARTMENTS_MEDI': dtype('float64'),
'LIVINGAREA_MEDI': dtype('float64'),
'NONLIVINGAPARTMENTS_MEDI': dtype('float64'),
'NONLIVINGAREA_MEDI': dtype('float64'),
'FONDKAPREMONT_MODE': dtype('O'),
'HOUSETYPE_MODE': dtype('O'),
'TOTALAREA_MODE': dtype('float64'),
'WALLSMATERIAL_MODE': dtype('O'),
'EMERGENCYSTATE_MODE': dtype('O'),
'OBS_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DEF_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
'OBS_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DEF_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DAYS_LAST_PHONE_CHANGE': dtype('float64'),
'FLAG_DOCUMENT_2': dtype('int64'),
'FLAG_DOCUMENT_3': dtype('int64'),
'FLAG_DOCUMENT_4': dtype('int64'),
'FLAG_DOCUMENT_5': dtype('int64'),
'FLAG_DOCUMENT_6': dtype('int64'),
'FLAG_DOCUMENT_7': dtype('int64'),
'FLAG_DOCUMENT_8': dtype('int64'),
'FLAG_DOCUMENT_9': dtype('int64'),
'FLAG_DOCUMENT_10': dtype('int64'),
'FLAG_DOCUMENT_11': dtype('int64'),
'FLAG_DOCUMENT_12': dtype('int64'),
'FLAG_DOCUMENT_13': dtype('int64'),
'FLAG_DOCUMENT_14': dtype('int64'),
'FLAG_DOCUMENT_15': dtype('int64'),
'FLAG_DOCUMENT_16': dtype('int64'),
'FLAG_DOCUMENT_17': dtype('int64'),
'FLAG_DOCUMENT_18': dtype('int64'),
'FLAG_DOCUMENT_19': dtype('int64'),
'FLAG_DOCUMENT_20': dtype('int64'),
'FLAG_DOCUMENT_21': dtype('int64'),
'AMT_REQ_CREDIT_BUREAU_HOUR': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_DAY': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_WEEK': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_MON': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_QRT': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_YEAR': dtype('float64')}
Aparentemente vamos a tratar con 3 tipos de variables, continuas ('float64', 'int64'), categóricas ('category') y booleanas ('0') que son aquellas que solo aceptan 2 tipos de valores, dichos valores suelen significar valores dicotomicos. EJ. FLAG_DOCUMENT_2 es una variable que recoge si el documento 2 fue entregado por el tomador del préstamo, esta variable solo toma el valor 0, que asumiremos que es entregado, y 1, que asumiremos que es no entregado. También se utilizaba antiguamente a menudo para variables de género siendo, por ejemplo, 0 = Masculino, 1 = Feminino.
#Returns us in categorical variables (From 'TARGET' to 'FLAG_DOCUMENT_21') and in Others, which do not fulfill the function ('SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH').
#The total number of categorical variables is 53 as it tells us at the end.
f_aux.dame_variables_categoricas(pd_loan)
(['TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'], ['SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH'], 53)
#Returns us in continuous variables (From 'AMT_INCOME_TOTAL') and in Others, which do not fulfill the function (From 'DEF_60_CNT_SOCIAL_CIRCLE' to 'AMT_REQ_CREDIT_BUREAU_WEEK).
#The total number of categorical variables is 61 as it tells us at the end.
f_aux.dame_variables_continuas(pd_loan)
(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'], ['DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK'], 61)
Comprobamos que todas las variables se estan tomando en cuenta de la siguiente forma, el resultado de dame_variables_categóricas es 53 variables categóricas y 4 others (57) y el resultado de dame_variables_continuas que es de 61 variables continuas y 4 others (65) el resultado es 65+57 = 122 variables. Las variables booleanas comentadas anteriormente se introducen en continuas ya que al estar conformadas por 0s y 1s los toma como una variable integer.
Estudiamos y analizamos la variable objetivo, que como hemos dicho anteriormente, es 'TARGET'
print(pd_loan['TARGET'].unique())
#We see how it only returns 0 and 1, in the additional excel we are told that 1= client with payment difficulties: has had a delay in payment of more than X days in at least one of the first Y installments of the loan.
#in our sample and 0 = other cases (we assume no difficulties).
[1 0]
pd_plot_loan_status = pd_loan['TARGET']\
.value_counts(normalize=True)\
.mul(100)\
.rename('percent')\
.reset_index()
pd_plot_loan_status.columns = ['TARGET', 'percent'] # Rename the column 'index' to 'TARGET'.
pd_plot_loan_status_conteo = pd_loan['TARGET'].value_counts().reset_index()
pd_plot_loan_status_conteo.columns = ['TARGET', 'count'] # Rename the count column
pd_plot_loan_status_pc = pd.merge(pd_plot_loan_status, pd_plot_loan_status_conteo, on='TARGET', how='inner')
pd_plot_loan_status_pc
| TARGET | percent | count | |
|---|---|---|---|
| 0 | 0 | 91.927118 | 282686 |
| 1 | 1 | 8.072882 | 24825 |
Obtenemos por lo tanto que, de nuestra muestra, casi un 92% (282686) de las personas no tuvieron dificultades y aproximadamente un 8% (24825) si que las tuvieron.
fig = px.histogram(pd_plot_loan_status_pc, x="TARGET", y='percent', title='Clients distribution')
fig.show()
Gráficamos la distribución, en caso de tener más de dos valores en la variable como pudieran ser préstamos que aun no han finalizado u otras valores que no son interesantes en nuestro análisis se eliminarian y nos quedariamos con las que realmente necesitamos.
A continuación vemos si se eliminan valores missing y/o nulos y se crean las listas correspondientes a aquellas filas/columnas con nulos con las que trabajaremos de aquí en adelante
pd_series_null_columns = pd_loan.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_loan.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
pd_null_columns = pd.DataFrame(pd_series_null_columns, columns=['columns_nuls'])
pd_null_rows = pd.DataFrame(pd_series_null_rows, columns=['rows_nulls'])
pd_null_rows['TARGET'] = pd_loan['TARGET'].copy()
pd_null_columns['columns_percentage'] = pd_null_columns['columns_nuls']/pd_loan.shape[0]
pd_null_rows['rows_percentage']= pd_null_rows['rows_nulls']/pd_loan.shape[1]
(122,) (307511,)
VEMOS QUE NÚMERO DE NULOS HAY POR COLUMNA Y COMO SE DISTRIBUYEN EN PORCENTAJE
pd_null_columns
| columns_nuls | columns_percentage | |
|---|---|---|
| COMMONAREA_MEDI | 214865 | 0.698723 |
| COMMONAREA_AVG | 214865 | 0.698723 |
| COMMONAREA_MODE | 214865 | 0.698723 |
| NONLIVINGAPARTMENTS_MODE | 213514 | 0.694330 |
| NONLIVINGAPARTMENTS_AVG | 213514 | 0.694330 |
| NONLIVINGAPARTMENTS_MEDI | 213514 | 0.694330 |
| FONDKAPREMONT_MODE | 210295 | 0.683862 |
| LIVINGAPARTMENTS_MODE | 210199 | 0.683550 |
| LIVINGAPARTMENTS_AVG | 210199 | 0.683550 |
| LIVINGAPARTMENTS_MEDI | 210199 | 0.683550 |
| FLOORSMIN_AVG | 208642 | 0.678486 |
| FLOORSMIN_MODE | 208642 | 0.678486 |
| FLOORSMIN_MEDI | 208642 | 0.678486 |
| YEARS_BUILD_MEDI | 204488 | 0.664978 |
| YEARS_BUILD_MODE | 204488 | 0.664978 |
| YEARS_BUILD_AVG | 204488 | 0.664978 |
| OWN_CAR_AGE | 202929 | 0.659908 |
| LANDAREA_MEDI | 182590 | 0.593767 |
| LANDAREA_MODE | 182590 | 0.593767 |
| LANDAREA_AVG | 182590 | 0.593767 |
| BASEMENTAREA_MEDI | 179943 | 0.585160 |
| BASEMENTAREA_AVG | 179943 | 0.585160 |
| BASEMENTAREA_MODE | 179943 | 0.585160 |
| EXT_SOURCE_1 | 173378 | 0.563811 |
| NONLIVINGAREA_MODE | 169682 | 0.551792 |
| NONLIVINGAREA_AVG | 169682 | 0.551792 |
| NONLIVINGAREA_MEDI | 169682 | 0.551792 |
| ELEVATORS_MEDI | 163891 | 0.532960 |
| ELEVATORS_AVG | 163891 | 0.532960 |
| ELEVATORS_MODE | 163891 | 0.532960 |
| WALLSMATERIAL_MODE | 156341 | 0.508408 |
| APARTMENTS_MEDI | 156061 | 0.507497 |
| APARTMENTS_AVG | 156061 | 0.507497 |
| APARTMENTS_MODE | 156061 | 0.507497 |
| ENTRANCES_MEDI | 154828 | 0.503488 |
| ENTRANCES_AVG | 154828 | 0.503488 |
| ENTRANCES_MODE | 154828 | 0.503488 |
| LIVINGAREA_AVG | 154350 | 0.501933 |
| LIVINGAREA_MODE | 154350 | 0.501933 |
| LIVINGAREA_MEDI | 154350 | 0.501933 |
| HOUSETYPE_MODE | 154297 | 0.501761 |
| FLOORSMAX_MODE | 153020 | 0.497608 |
| FLOORSMAX_MEDI | 153020 | 0.497608 |
| FLOORSMAX_AVG | 153020 | 0.497608 |
| YEARS_BEGINEXPLUATATION_MODE | 150007 | 0.487810 |
| YEARS_BEGINEXPLUATATION_MEDI | 150007 | 0.487810 |
| YEARS_BEGINEXPLUATATION_AVG | 150007 | 0.487810 |
| TOTALAREA_MODE | 148431 | 0.482685 |
| EMERGENCYSTATE_MODE | 145755 | 0.473983 |
| OCCUPATION_TYPE | 96391 | 0.313455 |
| EXT_SOURCE_3 | 60965 | 0.198253 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_MON | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 0.135016 |
| NAME_TYPE_SUITE | 1292 | 0.004201 |
| OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| EXT_SOURCE_2 | 660 | 0.002146 |
| AMT_GOODS_PRICE | 278 | 0.000904 |
| AMT_ANNUITY | 12 | 0.000039 |
| CNT_FAM_MEMBERS | 2 | 0.000007 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.000003 |
| CNT_CHILDREN | 0 | 0.000000 |
| FLAG_DOCUMENT_8 | 0 | 0.000000 |
| NAME_CONTRACT_TYPE | 0 | 0.000000 |
| CODE_GENDER | 0 | 0.000000 |
| FLAG_OWN_CAR | 0 | 0.000000 |
| FLAG_DOCUMENT_2 | 0 | 0.000000 |
| FLAG_DOCUMENT_3 | 0 | 0.000000 |
| FLAG_DOCUMENT_4 | 0 | 0.000000 |
| FLAG_DOCUMENT_5 | 0 | 0.000000 |
| FLAG_DOCUMENT_6 | 0 | 0.000000 |
| FLAG_DOCUMENT_7 | 0 | 0.000000 |
| FLAG_DOCUMENT_9 | 0 | 0.000000 |
| FLAG_DOCUMENT_21 | 0 | 0.000000 |
| FLAG_DOCUMENT_10 | 0 | 0.000000 |
| FLAG_DOCUMENT_11 | 0 | 0.000000 |
| FLAG_OWN_REALTY | 0 | 0.000000 |
| FLAG_DOCUMENT_13 | 0 | 0.000000 |
| FLAG_DOCUMENT_14 | 0 | 0.000000 |
| FLAG_DOCUMENT_15 | 0 | 0.000000 |
| FLAG_DOCUMENT_16 | 0 | 0.000000 |
| FLAG_DOCUMENT_17 | 0 | 0.000000 |
| FLAG_DOCUMENT_18 | 0 | 0.000000 |
| FLAG_DOCUMENT_19 | 0 | 0.000000 |
| FLAG_DOCUMENT_20 | 0 | 0.000000 |
| FLAG_DOCUMENT_12 | 0 | 0.000000 |
| AMT_CREDIT | 0 | 0.000000 |
| AMT_INCOME_TOTAL | 0 | 0.000000 |
| FLAG_PHONE | 0 | 0.000000 |
| LIVE_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| REG_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| TARGET | 0 | 0.000000 |
| REG_CITY_NOT_LIVE_CITY | 0 | 0.000000 |
| LIVE_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_LIVE_REGION | 0 | 0.000000 |
| HOUR_APPR_PROCESS_START | 0 | 0.000000 |
| WEEKDAY_APPR_PROCESS_START | 0 | 0.000000 |
| REGION_RATING_CLIENT_W_CITY | 0 | 0.000000 |
| REGION_RATING_CLIENT | 0 | 0.000000 |
| FLAG_EMAIL | 0 | 0.000000 |
| FLAG_CONT_MOBILE | 0 | 0.000000 |
| ORGANIZATION_TYPE | 0 | 0.000000 |
| FLAG_WORK_PHONE | 0 | 0.000000 |
| FLAG_EMP_PHONE | 0 | 0.000000 |
| FLAG_MOBIL | 0 | 0.000000 |
| DAYS_ID_PUBLISH | 0 | 0.000000 |
| DAYS_REGISTRATION | 0 | 0.000000 |
| DAYS_EMPLOYED | 0 | 0.000000 |
| DAYS_BIRTH | 0 | 0.000000 |
| REGION_POPULATION_RELATIVE | 0 | 0.000000 |
| NAME_HOUSING_TYPE | 0 | 0.000000 |
| NAME_FAMILY_STATUS | 0 | 0.000000 |
| NAME_EDUCATION_TYPE | 0 | 0.000000 |
| NAME_INCOME_TYPE | 0 | 0.000000 |
| SK_ID_CURR | 0 | 0.000000 |
CREO UN DATA SET LLAMADO pd_loan_filter_null en el cual serán eliminadas todas aquellas variables con un % de nulos > al valor de threshold, en este caso 0.9, como ninguna variable supera dicho umbral, el dataset tiene la misma forma que el original con el que estaba trabajando. De esta manera podemos obtener un análisis en el que más variables entren en juego ya que con, por ejemplo, un threshold de 0.6 se eliminarían variables como OWN_CAR_AGE (Años de vida del coche), COMMONAREA_X (Información normalizada acerca de donde vive el cliente), cosas que, a criterio personal en esta práctica, pueden aportar valor al análisis.
threshold=0.9
list_vars_not_null = list(pd_null_columns[pd_null_columns['columns_percentage']<threshold].index)
pd_loan_filter_null = pd_loan.loc[:, list_vars_not_null]
pd_loan_filter_null.shape
(307511, 122)
VARIABLES CATEGÓRICAS
list_var_cat, othercat,num_cat = f_aux.dame_variables_categoricas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_var_cat] = pd_loan_filter_null[list_var_cat].astype("category")
list_var_cat
['FONDKAPREMONT_MODE', 'WALLSMATERIAL_MODE', 'HOUSETYPE_MODE', 'EMERGENCYSTATE_MODE', 'OCCUPATION_TYPE', 'NAME_TYPE_SUITE', 'CNT_CHILDREN', 'FLAG_DOCUMENT_8', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_21', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_OWN_REALTY', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_12', 'FLAG_PHONE', 'LIVE_CITY_NOT_WORK_CITY', 'REG_CITY_NOT_WORK_CITY', 'TARGET', 'REG_CITY_NOT_LIVE_CITY', 'LIVE_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_LIVE_REGION', 'HOUR_APPR_PROCESS_START', 'WEEKDAY_APPR_PROCESS_START', 'REGION_RATING_CLIENT_W_CITY', 'REGION_RATING_CLIENT', 'FLAG_EMAIL', 'FLAG_CONT_MOBILE', 'ORGANIZATION_TYPE', 'FLAG_WORK_PHONE', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS', 'NAME_EDUCATION_TYPE', 'NAME_INCOME_TYPE']
Comprobamos que, ciertamente, todas las variables de la lista 'list_cat_vars' son categóricas/'category'
unique_dtypes_cat = pd_loan_filter_null[list_var_cat].dtypes.apply(lambda x: x.name).unique()
print("Unique data types in list_var_cat:", unique_dtypes_cat)
Unique data types in list_var_cat: ['category']
VARIABLES NUMÉRICAS
list_var_cont, othercont, num_cont = f_aux.dame_variables_continuas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_var_cont] = pd_loan_filter_null[list_var_cont].astype("float64")
list_var_cont
['COMMONAREA_MEDI', 'COMMONAREA_AVG', 'COMMONAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MEDI', 'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI', 'YEARS_BUILD_MEDI', 'YEARS_BUILD_MODE', 'YEARS_BUILD_AVG', 'OWN_CAR_AGE', 'LANDAREA_MEDI', 'LANDAREA_MODE', 'LANDAREA_AVG', 'BASEMENTAREA_MEDI', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'EXT_SOURCE_1', 'NONLIVINGAREA_MODE', 'NONLIVINGAREA_AVG', 'NONLIVINGAREA_MEDI', 'ELEVATORS_MEDI', 'ELEVATORS_AVG', 'ELEVATORS_MODE', 'APARTMENTS_MEDI', 'APARTMENTS_AVG', 'APARTMENTS_MODE', 'ENTRANCES_MEDI', 'ENTRANCES_AVG', 'ENTRANCES_MODE', 'LIVINGAREA_AVG', 'LIVINGAREA_MODE', 'LIVINGAREA_MEDI', 'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'FLOORSMAX_AVG', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BEGINEXPLUATATION_AVG', 'TOTALAREA_MODE', 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'EXT_SOURCE_2', 'AMT_GOODS_PRICE', 'AMT_ANNUITY', 'CNT_FAM_MEMBERS', 'DAYS_LAST_PHONE_CHANGE', 'AMT_CREDIT', 'AMT_INCOME_TOTAL', 'DAYS_REGISTRATION', 'REGION_POPULATION_RELATIVE']
pd_loan_filter_null[list_var_cont].dtypes
unique_dtypes = set(pd_loan_filter_null[list_var_cont].dtypes)
print(unique_dtypes)
{dtype('float64')}
Al igual que con las variables categóricas, comprobamos que en la lista de variables continuas solo hay variables continuas tal que 'float64'/'int64', en este caso 'float64'
Al igual que en la plantilla HTML, voy a tratar algunas variables que pueden ser interesante conocer datos a priori de ellas
A través del siguiente código, vemos el tipo de contrato que tiene el cliente diferenciando entre CASH LOANS (PRÉSTAMOS EN EFECTIVO DE UNA CANTIDAD FIJA LA CUAL SE ENTREGA AL MOMENTO DE PEDIRLO) y REVOLVING LOANS (PRÉSTAMOS MÁS FLEXIBLES EN LOS QUE SE ABRE UNA LINEA DE CRÉDITO CON UN LIMITE PREESTABLECIDO DE LA CUAL SE PUEDE RETIRAR DINERO HASTA DICHO LÍMITE) y vemos como se distribuye en 90,48% aproximadamente pidieron un 'Cash Loan' y el 9,52% restante solicitaron 'Revolving loans'
pd_loan_filter_null['NAME_CONTRACT_TYPE'].value_counts()
NAME_CONTRACT_TYPE Cash loans 278232 Revolving loans 29279 Name: count, dtype: int64
fig = px.histogram(
pd_loan_filter_null,
x="NAME_CONTRACT_TYPE",
title='Credit distribution by contract type',
histnorm='percent', # Convert counts to percentage
labels={"NAME_CONTRACT_TYPE": "Contract type"} #AXIS X
)
#AXIS Y
fig.update_yaxes(title_text="Percentage of the poblation")
fig.show()
Podría ser interesante ver también el género de los tomadores del préstamo en este caso, casi 2/3 de la muestra son de género femenino y 1/3 masculino.
pd_loan_filter_null['CODE_GENDER'].value_counts()
CODE_GENDER F 202448 M 105059 XNA 4 Name: count, dtype: int64
A través del siguiente código vemos, en porcentaje, la fuente de ingresos de los clientes
print(pd_loan_filter_null["NAME_INCOME_TYPE"].value_counts().count())
pd_loan_filter_null["NAME_INCOME_TYPE"]\
.apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)#.count()
8
NAME_INCOME_TYPE working 0.516320 commercial associate 0.232892 pensioner 0.180033 state servant 0.070576 unemployed 0.000072 student 0.000059 businessman 0.000033 maternity leave 0.000016 Name: proportion, dtype: float64
Más del 50% se encuentran como 'Working' o trabajando, esta variable puede ser engañosa por qué realmente otros valores como commercial associate, state servant o businessman también contaría como 'working'. Podría ser interesante de cara al futuro simplemente dividir entre aquellos que trabajan (working, commercial,businessman,state servant) y aquellos que, entiendo, no trabajan (pensioner,unemployed,student,martenity leave) para un análisis entre poblacion activa y no activa.
A través de sumar la lista que nos ha devuelto los 'others' para los 2 tipos de variables obtenemos las 8 variables que no son ni categóricas ni continuas
other_variables = othercat + othercont
other_variables
['DAYS_ID_PUBLISH', 'DAYS_EMPLOYED', 'DAYS_BIRTH', 'SK_ID_CURR', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'DEF_60_CNT_SOCIAL_CIRCLE']
Guardamos el dataframe pd_loan_filter_null a .csv para poder continuar trabajando con el en otro notebook y no realizar notebooks excesivamente largos
pd_loan_filter_null.to_csv("../data/pd_data_initial_preprocessing.csv")